WEEK 9: LINEAR REGRESSION & SIMULATING DISTRIBUTIONS

Monday, March 6th

Today we will…

  • Questions from Lab/Challenge 8
  • Final Project Part 2 – Linear Regression
  • Mini lecture on text material
    • Review of Simple Linear Regression
    • Conditions of Linear Regression
  • PA 9.1: Mystery Animal
  • Lab 9: Baby Names

NC Births Data

The ncbirths data set is a random sample of 1,000 cases taken from a larger data set collected in North Carolina in 2004.

Each case describes the birth of a single child born in North Carolina, along with various characteristics of the child (e.g. birth weight, length of gestation, etc.), the child’s mother (e.g. age, weight gained during pregnancy, smoking habits, etc.) and the child’s father (e.g. age).

library(openintro)
data(ncbirths)
head(ncbirths) |> 
  knitr::kable() |> 
  kableExtra::scroll_box(height = "200px")
fage mage mature weeks premie visits marital gained weight lowbirthweight gender habit whitemom
NA 13 younger mom 39 full term 10 not married 38 7.63 not low male nonsmoker not white
NA 14 younger mom 42 full term 15 not married 20 7.88 not low male nonsmoker not white
19 15 younger mom 37 full term 11 not married 38 6.63 not low female nonsmoker white
21 15 younger mom 41 full term 6 not married 34 8.00 not low male nonsmoker white
NA 15 younger mom 39 full term 9 not married 27 6.38 not low female nonsmoker not white
NA 15 younger mom 38 full term 19 not married 22 5.38 low male nonsmoker not white

Relationships Between Variables

In a statistical model, we generally have one variable that is the output and one or more variables that are the inputs.

  • Response variable
    • a.k.a. \(y\), dependent
    • The quantity you want to understand
  • Explanatory variable
    • a.k.a. \(x\), independent, explanatory, predictor
    • Something you think might be related to the response

Visualizing the Relationship

The scatterplot has been called the most “generally useful invention in the history of statistical graphics.”

Characterizing relationships

  • Form – linear, quadratic, non-linear?

  • Direction – positive, negative?

  • Strength – how much scatter/noise?

  • Unusual observations – do points not fit the overall pattern?

Your turn!

How would your characterize this relationship?

  • shape
  • direction
  • strength
  • outliers

Simple Linear Regression (SLR)

Assume the relationship between \(x\) and \(y\) takes the form of a linear function.

\[ response = intercept + slope \cdot explanatory + noise \]

Population Regression Model

\(Y = \beta_0 + \beta_1 \cdot X + \epsilon_i\)

\(\epsilon \sim N(0, \sigma)\)

Fitted Regression Model (Sample)

\(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 \cdot x\)

Fitting an SLR Model

ncbirths |> 
ggplot(aes(x = gained, 
           y = weight)
       ) +
  geom_jitter() + 
  geom_smooth(method = "lm") 

ncbirth_lm <- lm(weight ~ gained, 
                 data = ncbirths
                 )
library(broom)
tidy(ncbirth_lm)
# A tibble: 2 × 5
  term        estimate std.error statistic    p.value
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)   6.63     0.111       59.6  0         
2 gained        0.0161   0.00332      4.86 0.00000135
  • Intercept: Expected mean value for \(y\), when \(x\) is 0

  • Slope: Expected change in the mean of \(y\), when \(x\) is increased by 1 unit

Difference between the observed (point) and the expected (line).

ncbirth_lm$residuals |> 
  head(3)
         1          2          3 
 0.3866279  0.9271567 -0.6133721 
resid(ncbirth_lm) |> 
  head(3)
         1          2          3 
 0.3866279  0.9271567 -0.6133721 
library(broom)
augment(ncbirth_lm) |> 
  head(3)
# A tibble: 3 × 9
  .rownames weight gained .fitted .resid    .hat .sigma   .cooksd .std.resid
  <chr>      <dbl>  <int>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
1 1           7.63     38    7.24  0.387 0.00133   1.47 0.0000458      0.262
2 2           7.88     20    6.95  0.927 0.00157   1.47 0.000311       0.630
3 3           6.63     38    7.24 -0.613 0.00133   1.47 0.000115      -0.416

Model Diagnostics

Linear Relationship

  • Almost nothing you explore will look perfectly linear

  • Be careful with relationships that have curvature

  • Variable transformations can often help

Independence of observations

Nearly every statistical model you will encounter, require for the observations in the data to be independent.

This is often the most crucial but also the most difficult to diagnose.

What does independence mean?

Independence means that I should not be able to know the \(y\) value for one observation based on knowing the \(y\) value of another observation.

Situations with independence violations

If there is an inherent grouping of observations, then independence may be violated.

  • Stock market prices over time
  • Geographical similarities
  • Biological conditions of family members
  • Repeated observations

Normally distributed residuals

Observations vary symmetrically around the least squares line, spreading out in a bell-shaped fashion.

Less important than linearity or independence for a few reasons:

  • Least squares is an unbiased estimate of the true population model.
  • Larger sample sizes make the model more reliable.

Equal (constant) variance

  • The variability of points around the least squares line remains roughly constant.

  • Data that exhibit non-equal variance across the range of x-values will have the potential to seriously mis-estimate the variability of the slope.

ncbirth_lm |> 
  augment() |> 
ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() +

Assessing Model Fit

anova(ncbirth_lm)
Analysis of Variance Table

Response: weight
           Df  Sum Sq Mean Sq F value    Pr(>F)    
gained      1   51.36  51.357  23.642 1.353e-06 ***
Residuals 971 2109.32   2.172                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(ncbirth_lm)

Call:
lm(formula = weight ~ gained, data = ncbirths)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4085 -0.6950  0.1643  0.9222  4.5158 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.63003    0.11120  59.620  < 2e-16 ***
gained       0.01614    0.00332   4.862 1.35e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.474 on 971 degrees of freedom
  (27 observations deleted due to missingness)
Multiple R-squared:  0.02377,   Adjusted R-squared:  0.02276 
F-statistic: 23.64 on 1 and 971 DF,  p-value: 1.353e-06
  • Sum of Square Errors (SSE)
    • sum of squared residuals
  • Root Mean Square Error (RMSE)
    • standard deviation of residuals
  • R-squared
    • proportion of variability in response accounted for by the linear model

Model Comparison

Weight as explanatory variable

weight_weeks <- lm(weight ~ weeks, 
                   data = ncbirths
                   )
  • SSE = 1246.55
  • RMSE = 1.119
  • \(R^2\) = 0.449

Visit as explanatory variable

weight_visits <- lm(weight ~ visits, 
                    data = ncbirths
                    )
  • SSE = 2152.74
  • RMSE = 1.475
  • \(R^2\) = 0.01819

Extending to Multiple Linear Regression

If you have multiple explanatory variables you want to include in the model:

lm(y ~ x1 + x2 + x3 + ... + xn)

If you want to include interactions:

lm(y ~ x1 + x2 + x1:x2)

or use the shortcut:

lm(y ~ x1*x2)

You can include quadratic relationships (recall \(ax^2 + bx + c\))

lm(y ~ I(x1^2) + x1)

A quick note about piping |> and _

Instead of a .x you use _

weight_fullterm <- ncbirths |> 
  filter(premie == "full term") |> 
  lm(weight ~ weeks, 
     data = _
     )

Worktime

To do…

  • PA 9.1: Mystery Animal
    • Due Wednesday, 3/8 at 8:00am
  • Lab 9: Baby Names
    • Due Friday, 3/10 at 11:59pm
  • Challenge 9: Formatting Nice Tables
    • Due Saturday, 3/11 at 11:59pm
  • Project Check-point 2: Linear Regression
    • Due Sunday, 3/12 at 11:59pm

Wednesday, March 8th

Today we will…

  • Mini lecture on text material
    • Simulating Distributions
  • Worktime
    • PA 9.2: Instrument Con
    • Lab & Challenge 9
    • Project: Linear Regression

Statistical Distributions

Recall from your previous statistics classes…

A random variable is a value we don’t know until we take a sample.

  • Coin flip: Could be heads (0) or tails (1)
  • Person’s height: Could be anything from 0 feet to 20 feet.
  • Annual income of a US worker: $0 to $1.6 billion

The distribution of the random variable tells us the possible values and how likely they are

  • Coin flip has 50% chance of heads, 50% chance of tails
  • Likelihood of heights follow a bell curve shape around 5 foot 7.
  • Most American workers make under $100,000

Some distributions have names

For this class, you need to know a few of them.

Uniform Distribution

When you know the range, but not much else.

All values in the range are equally likely to occur.

Normal Distribution

When you expect values to fall near the center.

Frequency of values follows a bell curve shape.

t-Distribution

Slightly wider bell curve shape.

Basically, the same context as the Normal distribution, but used more often in real data. (When the standard deviation is unknown.)

Chi-Square Distribution

Somewhat skewed, and only allows values above zero.

Used in testing count data.

Binomial Distribution

Appears when you have two possible outcomes, and you are counting how many times each outcome occurred.

(Common example: voting data)

Statistical Distributions

Things you might want to use a Statistical distribution for:

  • Find the probability of an event

    • If I flip 10 coins, what are the chances I get all heads?
  • Estimate a percentage of a population:

    • About what percent of people are above 6 feet tall?
  • Quantify the evidence in your data:

    • In my survey of 100 people, 67 said they were voting for Measure A. How confident am I that Measure A will pass?

Statistical Distributions in R – r, p, d, q

  • r stands for for random sampling: generate a random value from the distribution

  • We typically use this to simulate date; that is, to generate pretend observations to see what happens.

runif(n = 3)
[1] 0.6074404 0.5133956 0.6360152
runif(n = 3, min = 10, max = 20)
[1] 18.42964 19.71224 14.78626
rnorm(n = 3)
[1]  1.1640752 -1.7659451  0.3913231
rnorm(n = 3, mean = -100, sd = 50)
[1]  -76.55332  -42.76307 -130.96689
rt(n = 3, df = 11)
[1] 0.5711975 1.2991483 0.2861644
rbinom(n = 3, size = 10, prob = 0.7)
[1] 5 9 8
rchisq(n = 3, df = 11)
[1] 6.827248 7.510705 2.871458
  • p for probability: i.e. compute the chances of an observation less than x

  • We usually use this for calculating p-values

pnorm(q = 1.5)
[1] 0.9331928
pnorm(q = 70, mean = 67, sd = 3)
[1] 0.8413447
1 - pnorm(q = 70, mean = 67, sd = 3)
[1] 0.1586553
pnorm(q = 70, mean = 67, sd = 3, lower.tail = FALSE)
[1] 0.1586553
  • q stands for quantile: given a probability p, compute the x-value such that \[P(X < x) = p.\]

  • the q functions are the “backwards” version of the p functions

qnorm(p = 0.95)
[1] 1.644854
qnorm(p = 0.95, mean = 67, sd = 3)
[1] 71.93456

d stands for density: compute the height of distribution curve

  • Discrete distributions: probability of getting that exact value

  • Continuous distributions: usually meaningless

What is the probability of getting exactly 12 heads in 20 coin tosses, for a coin with a 70% chance of tails?

dbinom(x = 12, size = 20, prob = 0.3)
[1] 0.003859282

Simulating Fake Data

Since there is randomness involved, if we want to create a reproducible random sample, we “set the seed” on so we get the same random sample each time the code is run.

set.seed(94301)
fake_data <- tibble(
  names   = charlatan::ch_name(1000),
  height  = rnorm(1000, mean = 67, sd = 3),
  age     = runif(1000, min = 15, max = 75),
  measure = rbinom(1000, size = 1, prob = 0.6)
) |> 
  mutate(
    supports_measure_A = ifelse(measure == 1, "yes", "no")
  )

head(fake_data)
# A tibble: 6 × 5
  names                    height   age measure supports_measure_A
  <chr>                     <dbl> <dbl>   <int> <chr>             
1 Dr. Dagmar Johns           65.9  31.7       0 no                
2 Freddy Donnelly            66.5  24.2       1 yes               
3 Mr. Fielding Hackett DDS   71.0  62.9       1 yes               
4 Frankie King               64.3  63.0       1 yes               
5 Jaslyn Hickle              68.5  66.2       1 yes               
6 Dr. Jessee Brekke DDS      71.7  50.0       1 yes               
Code
fake_data |> 
  ggplot(aes(y = supports_measure_A, 
             x = age,
             fill = supports_measure_A)
         ) +
  ggridges::geom_density_ridges(show.legend = F) +
  scale_fill_brewer(palette = "Paired") +
  theme_bw() +
  labs(x = "Age (years)",
       y = "",
       subtitle = "Support for Measure A",
       )

Plotting distributions

fake_data |> 
  ggplot(aes(x = height)) +
  geom_histogram(bins = 10, fill = "grey", color = "white") +
  theme_bw()

fake_data |> 
  ggplot(aes(x = height)) +
  geom_histogram(bins = 10, fill = "grey", color = "white") +
  stat_function(fun = dnorm, 
                col = "steelblue", 
                lwd = 2
                )

fake_data |> 
  ggplot(aes(x = height)) +
  geom_histogram(bins = 10, fill = "grey", color = "white") +
  stat_function(fun = ~ dnorm(.x, mean = 67, sd = 3), 
                color = "steelblue", 
                lwd = 2
                )

fake_data |> 
  ggplot(aes(x = height)) +
  geom_histogram(aes(y = ..density..),
                 bins = 10, fill = "grey", color = "white") +
  stat_function(fun = ~ dnorm(.x, mean = 67, sd = 3), 
                color = "steelblue", 
                lwd = 2
                )

Empirical vs Theoretical Distributions

Code
fake_data %>%
  ggplot(aes(x = height)) +
  geom_histogram(aes(y = ..density..), bins = 10) 

Code
fake_data %>%
  ggplot(aes(x = height)) +
  stat_function(fun = ~dnorm(.x, mean = 67, sd = 3), 
                col = "steelblue", lwd = 2) +
  stat_function(fun = ~dnorm(.x, mean = 67, sd = 2), 
                col = "orange2", lwd = 2)

In-line code `r `

When writing up a report which includes results from a random generation process, in order to ensure reproducibility in your document, use r to include your output within your written description.

my_rand <- rnorm(1, mean = 0, sd = 1)
my_rand
[1] 0.4240747

My random number is 0.4240747.

To do…

  • PA 9.2: Instrument Con
    • Due Friday, 3/10 at 8:00am
  • Lab 9: Baby Names
    • Due Friday, 3/10 at 11:59pm
  • Challenge 9: Formatting Nice Tables
    • Due Saturday, 3/11 at 11:59pm
  • Project Check-point 2: Linear Regression
    • Due Sunday, 3/12 at 11:59pm
  • Read Chapter 10: Predictive Checks and complete Check-in 10.1
    • Due Monday, 3/13 at 8:00am